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Abstract 

For decades, educators have relied on readability metrics that tend to oversimplify 
dimensions of text difficulty. This study examines the potential of applying advanced 
artificial intelligence methods to the educational problem of assessing text difficulty. 
The combination of hierarchical machine learning and natural language processing 
(NLP) is leveraged to predict the difficulty of practice texts used in a reading compre- 
hension intelligent tutoring system, iSTART. Human raters estimated the text difficulty 
level of 262 texts across two text sets (Set A and Set B) in the iSTART library. NLP 
tools were used to identify linguistic features predictive of text difficulty and these 
indices were submitted to both flat and hierarchical machine learning algorithms. 
Results indicated that including NLP indices and machine learning increased accuracy 
by more than 10% as compared to classic readability metrics (e.g., Flesch-Kincaid 
Grade Level). Further, hierarchical outperformed non-hierarchical (flat) machine learn- 
ing classification for Set B (72%) and the combined set A + B (65%), whereas the non- 
hierarchical approach performed slightly better than the hierarchical approach for Set A 
(79%). These findings demonstrate the importance of considering deeper features of 
language related to text difficulty as well as the potential utility of hierarchical machine 
learning approaches in the development of meaningful text difficulty classification. 
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Introduction 


Text remains a crucial learning tool in most classrooms today (Fuchs et al. 2014). In the 
classroom, students are often tasked to read texts and textbooks in order to learn new 
information. As such, many educators and text publishers attempt to level texts such 
that text difficulty is appropriate for the students (National Governors Association 
Center for Best Practices 2010). Readability formulas have been used for well over a 
century as a means to evaluate text difficulty. Indeed, teachers have long relied on 
readability metrics to select classroom materials (e.g., Fry 2002; Chall 1988). In many 
ways, this practice is well grounded: theories of learning suggest that learning occurs 
most readily when tasks are tailored to students’ ability (e.g., Bjork 1994; Vygotsky 
1978). Vygotsky’s well-known zone of proximal development posits that tasks that are 
challenging, but potentially achievable with adequate support, are more effective for 
learning than tasks that are too easy or too difficult. With this in mind, many researchers 
and publishers have developed ways to estimate text difficulty in order to match 
reading assignments to students’ estimated skill levels (Benjamin 2012). 

Finding texts that are matched to course content, students’ interest, and their current 
reading skill is challenging. Given the abundance of materials available and the varying 
needs of each of their students, instructors simply do not have the time and resources to 
engage in careful evaluation of text difficulty. One approach has been to rely on 
publishers’ anthologies (e.g., basal readers), which define texts according to their 
targeted grade level. Though grade level anthologies offer instructors a quick way to 
find texts that are relevant to the “average” student at a given grade level, the criteria for 
how these texts are selected are often unclear and unsystematic. For example, Scho- 
lastic, a leading publisher for children’s books offers a variety of systems to identify 
grade-level appropriate texts. These systems vary in terms of focus (e.g., interest, skill) 
and grain-size. It is of note that Scholastic acknowledges that not all of their books have 
been levelled using the same systems,’ leaving it up to the instructors to make these 
cross-system comparisons. 

The problem of selecting appropriately-challenging texts is magnified in educational 
technologies that rely on text materials. In order to deliver texts that are well-matched to 
ability, there needs to be a large body of texts for the system to draw upon to meet the 
needs of each student and adapt to those needs as the students’ skills change across 
instruction. Thus, the constraints faced by an individual instructor needing to find 
appropriate texts for a classroom of students is further amplified when scaling up 
instruction through automation. As such, developing valid and facile means of 
assessing text difficulty remains an important issue across many areas in education. 

The most common approach has been the use of readability formulas (Bormuth 
1966, 1969) such as Flesch-Kincaid Reading Ease or Grade Level (Flesch 1948; 
Kincaid et al. 1975; Klare 1974), Dale Chall (Dale and Chall 1948), Gunning Fox 
(Gunning 1969), or Lexile (Lennon and Burdick 2004; Stenner et al. 1988). These 
measures are relatively easy to calculate (e.g., number of words per sentence and 
number of syllables per word multiplied by constants) and are even embedded in basic 
word processing software. Indeed, most readability formulas have been driven by ease 
of computation, overlooking key aspects of language related to comprehension and 


" https://www.scholastic.com/teachers/articles/teaching-content/leveled-reading-systems-explained/ 
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learning processes (Duran et al. 2007; Graesser et al. 2004; McNamara et al. 2012; 
McNamara et al. 2014; McNamara et al. 1996). Texts that contain many short sentences 
and words are typically rated as “easy” by readability metrics. However, these metrics 
of readability can be poor at predicting comprehension. For example, Begeny and 
Greene (2014) assessed passages from the Dynamic Indicators of Basic Early Literacy 
Skills (DIBELS) test, a test for assessing the acquisition of early literacy skills, using 
eight common readability formulas. They then asked 360 students from different grades 
to read the passages. They found that the readability formulas were at or below chance 
in identifying the appropriate grade level. One consequence of imprecise, or even 
inaccurate, measures of text difficulty is that students may be assigned texts that are 
too difficult or too easy (McNamara et al. 1996). Such a mismatch can lead to sub- 
optimal learning and, potentially, frustration from both the students and teachers when 
students do not perform as well as they “should”. 

One reason for this disconnect between readability and reading comprehension is 
that readability algorithms rely on superficial aspects of language instead of discourse- 
level features (Duran et al. 2007; Graesser et al. 2004). For example, content-driven 
texts may include short sentences composed of monosyllabic, but complex and topic- 
specific words (e.g., “quark”; Si and Callan 2001) or there may be coherence gaps 
between the ideas conveyed across sentences, rendering the text more difficult to 
understand (Graesser et al. 2004). 

One means of improving evaluation of text difficulty is to go beyond simple word- 
based metrics to include linguistic and semantic indices related to discourse compre- 
hension. Advances in NLP have allowed researchers to extract rich information about 
the linguistic features of a text that reflect complex dimensions such as narrativity, 
syntactic complexity, and cohesion (e.g., Crossley et al. 2016a; Crossley et al. 2016b; 
Duran et al. 2007; McNamara et al. 2014). These tools include part-of-speech (POS) 
taggers, parsers, sentiment analyzers, and semantic role labellers. More recently, and of 
particular relevance for the current study, a large number of NLP tools have been 
developed that are driven by cognitive theory to more closely mirror human judge- 
ments of text difficulty. These features include lexical sophistication (Kyle and 
Crossley 2015; Kyle et al. 2018), syntactic complexity (Kyle 2016), and cohesion 
(Crossley et al. 2016a; Crossley et al. 2018; McNamara et al. 2014), as well as clause- 
level features and rhetorical features (McNamara et al. 2013). 

One such NLP tool, Coh-Metrix (McNamara et al. 2014), assesses more than 200 
measures of cohesion, language, and readability.7 Coh-Metrix integrates a number of 
sophisticated tools (such as advanced syntactic parsers, POS taggers, and distributional 
models) and psycholinguistic databases (Salsbury et al. 2011) to generate indices of 
language, text, and readability (Duran et al. 2007). Coh-Metrix reports standard 
readability metrics (e.g., Flesch-Kincaid) as well as other word-level indices (e.g., 
familiarity, concreteness) drawn from the MRC Psycholinguistics Database 
(Coltheart 1981; Gilhooly and Logie 1980; Paivio et al. 1968; Toglia and Battig 
1978). In addition, Coh-Metrix returns indices that evaluate the degree to which 
information is being connected from sentence-to-sentence, sentence-to-paragraph, and 
paragraph-to-paragraph. Such connections afford a more coherent mental 


? For more on NLP, see McNamara et al. (2018). For a more thorough discussion of Coh-Metrix, see Graesser 
et al. (2004) and McNamara et al. (2014). 
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representation in the mind of the reader. Indeed, indices of cohesion predict ease of 
comprehension (Allen et al. 2016; Allen et al. 2015; Graesser et al. 2004; Millis et al. 
2007; Ozuru et al. 2005). Thus, one aspect of this work is the demonstrate that the 
automated assessment of text difficulty can be enhanced through examination of these 
theoretically-motivated features of language. 


Machine Learning 


The central purpose of the current study is to examine the effectiveness of different 
types of machine learning algorithms to predict text difficulty, in addition to evaluating 
the effectiveness of using deeper linguistic features such as cohesion. Classic readabil- 
ity formulas rely on General Linear Modeling (GLM), which involves a set of a priori 
statistical assumptions about the nature of the data set that may or may not be 
appropriate depending on the circumstance. In contrast, many machine learning (ML) 
approaches do not make similar statistical assumptions. Recent work has demonstrated 
the utility of machine learning techniques in predicting text difficulty. For example, 
Pitler and Nenkova (2008) compared linear regression and support vector classification 
approaches. Relevant to the current study, they found that the combination of lexical, 
syntax and coherence features was more predictive than including only surface level 
features. Feng et al. (2010) further demonstrated that classification models performed 
better than regression models. Brunato et al. (2018) used similar approaches but for 
sentence-level text difficulty, and Vajjala and Meurers (2012) applied classification 
models to improve prediction accuracy using insights from second language acquisition 
(SLA). Tanaka-Ishii et al. (2010) used a slightly different method from other re- 
searchers, treating text readability as a ranking problem rather than classification or 
regression. Others (e.g., Collins-Thompson 2014; Francois and Miltsakaki 2012; 
Heilman et al. 2008; Kate et al. 2010; Kotani et al. 2011; Pilan et al. 2014; Pilan 
et al. 2016; Schwarm and Ostendorf 2005; Sung et al. 2015)° have demonstrated 
benefits of incorporating theoretically-motivated linguistic features in the context of 
machine learning and text classification. The present study builds on this body of work 
by considering the possible advantages of hierarchical machine learning. 


Flat Vs. Hierarchical Approaches to Machine Learning 


The previously mentioned machine learning studies focused on the differences between 
regression and methods such as classification or ranking. By contrast, in this study, we 
examine the advantages of using flat versus hierarchical classification approaches for 
text readability. Hierarchical approaches to machine learning involve a series of 
classifications. 

Non-hierarchical, flat classification is the simplest and the most direct approach to 
machine learning. It uses either a single or ensemble classifier, and all the class variable 
instances in the training dataset. Imagine, for example, categorizing 100 supermarket 


3 We were unable to locate downloadable software or corpora associated with these studies. Thus, we could 
not compare our algorithms to those used in these studies. Notably, that was not the purpose of this study nor 
does this affect the validity of the previous studies. 
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items into 10 classes. In a flat classification, a “rater” would pick up each of the 100 
items one at a time and consider the item based on a set of features. The rater would 
then use this exploration of features to place the item in one of the 10 categories. 
Alternatively, data can be classified using a binary hierarchical classification (see 
Fig. 1), which is based on the divide-and-conquer strategy (Casasent and Wang 2005; 
Kumar and Ghosh 1999; Wang and Casasent 2009). Using the supermarket example, 
the items would first be broken into two macro-classes (e.g., meat and produce). At the 
next level, the produce macro-class would be further divided into two smaller macro- 
classes of fruit and vegetable, and then fruit macro-class would be divided as stone 
fruit, citrus, or berries until all items were in divided in one of the 10 final classes. 

In binary hierarchical classification, the classes are divided into two smaller macro- 
classes at each node. Only log>K classifiers need to be traversed in order to move from 
the top to a bottom decision node. One can use multiple hierarchical classification as 
well if there are large number of categories. Given the limited number of categories in 
our targeted corpora (see Corpus), it made more sense to use binary classification 
approach instead of multi-classification. 

Many important real-world classification problems are naturally treated as hierar- 
chical classification problems, where the classes to be predicted are naturally organized 
in a class hierarchy. As a result, classification problems where classes are arranged in a 
hierarchy can be expected to perform better with hierarchical approach as compared to 
using a flat classification. Hierarchical classification has been used in protein classifi- 
cation (Cerri et al. 2015; Triguero and Vens 2016; Zimek et al. 2008), text classification 
(Cesa-Bianchi et al. 2006; Mayne and Perry 2009), essay scoring (McNamara et al. 
2015), image annotation (Dimitrovski et al. 2011), automatic target recognition 
(Casasent and Wang 2005). A few studies have demonstrated that hierarchical ap- 
proaches outperform flat classifiers (Kumar et al. 2002; Schwenker 2000). 


Classes 1, 2, 3 etc. 


Macro-class Macro-class 
A B 


Macro-class Macro-class Macro-class Macro-class 
Al A2 B1 B2 


Class m Class n 


Fig. 1 Hierarchical Classification Structure 
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To our knowledge, hierarchical approaches have not been applied to classification of 
text difficulty. Such approaches may be well suited to text difficulty for several reasons. 
First, the features of texts that make them appropriate for middle school readers might 
be different than what makes a text more or less difficult for high school readers — that 
is, there are potential qualitative differences in different categories of texts. Second, 
such approaches may better reflect processes involved in human judgements of text 
difficulty. Imagine an instructor identifying the difficulty of a given text. The instructor 
might first determine if the text is appropriate for young children or adolescents, thus 
determining a grade band before “drilling down” into a specific grade. Text difficulty 
involves classifying items into a number of classes that naturally form a hierarchy, 
rather than simple dichotomous identification. As such, a hierarchical approach may be 
particularly apt for this complex task. 


The Current Study 


The current study leverages advances in NLP and ML to examine how the two in 
combination might predict human ratings of text difficulty. It is an extension of 
research into hierarchical ML approaches conducted in Balyan et al. (2018). In this 
work by Balyan et al. (2018), the potential utility of a hierarchical approach was 
demonstrated. In the current work, we systematically investigate how features derived 
from natural language processing can improve upon simple readability metrics and how 
the combination of NLP and ML approaches can support more authentic and accurate 
estimations of text difficulty. Notably, the aim of this work was not to produce new 
readability formulas, but to demonstrate new ways of estimating grade level that are 
considerate of complex aspects of language implicated by theories of discourse com- 
prehension. The work examines the accuracy of ML classification approaches in 
predicting human ratings of text difficulty. Importantly, we used human comparison 
ratings of text difficulty rather than pre-labelled data (e.g., texts with grade difficulty 
assigned from unknown origins). 

In previous studies, the corpora were either randomly selected from large corpora or 
were not publicly available. Thus, rather than attempting to replicate results with 
imperfect comparisons, we selected a new corpus that reflected the type of text set 
that educators might encounter in their practice. Our selected corpora were two sets of 
expository texts appropriate for a wide range of readers (elementary through college). 
These text sets were pre-existing and afforded us the opportunity to test our premises on 
authentic data sets. 

We used NLP tools to identify critical linguistic indices. These indices were then 
used as predictors to train the models, which were in turn trained using both flat and 
hierarchical ML classification approaches. Our objective was to examine ML classifi- 
cation accuracy as compared to and in combination with classic readability metrics. We 
compared a variety of classification algorithms as there is no universally best learning 
algorithm that fits all, and the “best” models differ depending on the classification task 
at hand (Caruana and Niculescu-Mizil 2006). For example, SVMs are designed to 
perform better on high-dimensional data, but are considered complex and it takes a long 
time to train a model. Tree-based methods are not influenced by outliers and 
multicollinearity, and do not make any assumptions on the distribution of data but 
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these methods do not have a very high predicting power. The performance of an 
algorithm may depend on whether the problem is linearly or non-linearly separable, 
the type of kernel used, or the type of hyper-parameters used while training the model. 
Therefore, rather than selecting a single classifier, we explored several machine 
learning classification algorithms in Weka (version 3.8.1). 

It was hypothesized that the NLP indices motivated by cognitive theories of 
discourse comprehension would be better predictors of text difficulty compared to 
classic readability metrics (e.g., Flesch-Kincaid Grade Level). It was also predicted that 
the combination of NLP and ML and, more specifically, a hierarchical ML approach 
would yield classifications more similar to the human ratings than a flat or non- 
hierarchical classification. Flat classification approaches make a single decision involv- 
ing all the categories in the data. It is difficult to make a single decision on multiple 
categories that may potentially be unbalanced (Babbar et al. 2013) as compared to 
making decision on a step-by-step dichotomous data, which can be more accurate. 


General Methods 
Corpus 


We conducted our experiments on existing real-world text sets. The corpus included 
the two text sets within Interactive Strategy Training for Active Reading and 
Thinking (START) - an intelligent tutoring system (ITS) that supports successful 
reading comprehension of complex informational texts through self-explanation 
training (McNamara et al. 2004; Snow et al. 2016). The text sets were collected 
from a variety of open-source resources in the context of other iSTART develop- 
ment projects. These text sets were ideal for this set of experiments as they were 
designed to have a variety of levels of difficulty (see Perret et al. 2017; Snow et al. 
2016). Set A included texts that varied widely in genre and difficulty, whereas Set B 
comprised texts used to provide information typical in high school and college 
science courses. One limitation to the corpus is that it is relatively small. Another 
limitation of note is that the corpus contains an unbalanced number of instances in 
each class. That is, we did not actively select the same number of texts at each level 
of difficulty. While class imbalance can affect the accuracy of ML models, the 
corpus reflects an authentic challenge. Specifically, the library of texts was selected 
because it is embedded within a real-world learning environment, and not for the 
purposes of developing a classification model. 

The first set, Set A, was comprised of texts from the iSTART StairStepper module 
(n= 162), including expository, informational texts about science, history, pop culture, 
and sports. These texts were collected from publicly available websites (see Perret et al. 
2017) and contained, on average, 389 words and 27 sentences. 

The second set of texts, Set B (1 = 100) was comprised of the texts from iSTART’s 
main text library. These texts are complex, informational texts about scientific phe- 
nomena compiled in the original development of iSTART (Jackson and McNamara 
2013). The texts were culled from various sources, primarily science textbooks. Each 
text in Set B had an average 380 words and 25 sentences. The number of texts in each 
genre are shown in Table 1. 


DQ Springer 


344 International Journal of Artificial Intelligence in Education (2020) 30:337-370 


Table 1 Frequency of Texts by Topic 


Text / Topic Science Social Sciences Sports Pop Culture Total 
Set A 47 75 19 13 162 
Set B 100 0 0 0 100 


Human Ratings 


Some of the texts in the corpus were pre-labelled with a grade level difficulty by the 
original sources. However, we found no relations between these pre-labelled levels 
and common readability measures. In addition, we found several examples of 
passages that were inconsistently labelled across sources. In order to establish 
accurate benchmarks, we employed human comparison ratings to evaluate the 
difficulty of each text. 

The difficulty of the texts in the two sets were rated separately, but followed the 
same procedure (see Johnson et al. 2017). We developed a qualitative approach 
drawn from methods of discourse analysis (e.g., Gee 2004; van Dijk 1985) that 
assess the text beyond word and sentence level information to consider the text as a 
whole. This approach was similar to an unsupervised clustering task, in which four 
raters (members of the research lab) iteratively sorted the texts as a team. As a 
group, the four raters first did a “rough sort” of entire set of texts into three broad 
sets (easy, medium, and difficult). In this first sort, we were not doing direct 
comparisons of each text, but rather an approximation of text difficulty. The group 
of raters then read each text more carefully and separated each of the three levels 
into “easier” and “more difficult”, yielding six levels. This process continued until 
there was unanimous agreement amongst the raters that the set of texts could not be 
separated any further. Adjacent levels were combined and resorted by each rater 
until there was agreement that 1) each level was distinguishably different from the 
adjacent level and that 2) every text in a given level was of comparable difficulty. 
Disagreements were resolved through discussion. Thus, agreement was reached 
across all four raters for all difficulty levels. This process resulted in 12 levels for 
Set A and 9 levels for Set B. Because Set B was designed to include more difficult 
texts than Set A, the levels across the text sets needed to be aligned. Two of the 
original raters determined the comparable difficulty across the two sets. Raters read 
the easiest texts from Set B and compared this group of texts to the levels in Set A. 
It was agreed that the easiest texts in Set B were equivalent to the sixth level of 
difficulty in Set A. Further reading and discussion confirmed that each increasing 
level of difficulty in Set A matched those in Set B. Consequently, Set A texts 
ranged in difficulty from level | to level 12 and Set B texts ranged in difficulty from 
level 6 to 14. The combined Set A and Set B corpus resulted in 262 texts 
categorized into 14 difficulty levels (1-14). 

Correlational analyses indicated that these human ratings of text difficulty were 
strongly related to Flesch-Kincaid Grade level for set A (r= 0.79). In Set B, where the 
text difficulty range was higher, this correlation was less strong (r=0.41). These 
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correlations indicate that the expert ratings were consistent with FKGL, but not 
redundant. This suggests that our human judgements of difficulty depended on differ- 
ent, or at least additional, features of the text not considered in simple readability 
formulas, particularly for more complex texts. 

In exploratory experiments, we used different ML algorithms, with readability 
measures such as Flesch Kincaid Grade Level and Flesch Kincaid Reading Ease and 
several other linguistic features considering all text difficulty levels (1-12 for Set A and 
6-14 for set B). The classification accuracy for Set A for the ML algorithms was quite 
low, ranging from 13.33% to 31.67% for readability formulas and 25.97% to 33.95% 
when additional linguistic features were used. The accuracy range for Set B was 
between 8.11% and 18.92% for readability formulas, which improved slightly when 
additional linguistic features were used, ranging between 29.17% and 35.42%. The 
classification accuracy further decreased when we combined the two data sets with 
accuracy ranging from 19.44% to 26.39%. Consequently, we clustered the fine-grained 
text difficulty levels (1-14) into more coarse-grained levels. The researchers re-read the 
texts in each difficulty level to identify intuitive breaks in the text set. This resulted in 
four difficulty levels: low (1-4), middle (5—8), high (9-12), and very high (13 and 14). 
Set A included low, middle, and high difficulty texts, while Set B included middle, 
high, and very high difficulty texts. The distribution and partitions made across the 
difficulty levels across the texts is shown in Fig. 2. 


(Low) (V High) 
Elementary Middle High College 


TEXT DIFFICULTY LEVELS 


17 
1 I i 1 al | 


a 


Set A 
Set B - 25 52 23 


Fig. 2 Text difficulty levels across the two sets of texts 
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Feature Extraction and Selection 


Following the corpus collection and human rating of difficulty levels of those texts, the 
next step included extraction of linguistic features/indices of the two text sets. We 
employed Coh-Metrix (McNamara et al. 2014) to extract the linguistic features of texts. 
Coh-Metrix returns over 200 linguistic features, some of which are not theoretically- 
relevant for the informational texts in this corpus. The features when extracted from the 
Coh-Metrix vary significantly in their scales and therefore the data were normalized 
(.e., rescaled the values using Min-Max method in the range of [0,1]) before any 
machine learning algorithms were used. Further, to avoid overfitting, we used feature 
selection methods to reduce the dimensionality of features (i.e., the number of indices). 
We removed features having zero variance (ZV) or nearly zero variance (NZV) and 
applied a recursive feature elimination (RFE) approach. We tried multiple classifiers for 
the RFE including random forest, Naive Bayes and bagged tree to determine the 
common features selected by the classifiers. Because RFE can be negatively affected 
by multicollinearity (Lieberman and Morris 2014), we also removed highly correlated 
(Pearson’s r>.85) features before applying RFE. 

Applying these feature selection approaches resulted in a set of eight linguistic 
indices (see Table 2). Notably, these features have been shown to correlate with text 
difficulty in prior work (Crossley et al. 2012; Crossley and McNamara 2009; Klare 
1984; McCarthy and Jarvis 2010). These indices relate to lexical sophistication, 
readability, lexical diversity, and syntactic complexity, and are briefly described in 
Table 2. Age of acquisition was found to be the most important whereas familiarity was 
the least important for all the data sets when classifying texts. 

A series of analysis of variance (ANOVA) tests confirmed that each of the eight 
linguistic features differed as a function of the human ratings of text difficulty, both for 
the two sets of data individually and for the combined dataset (Table 3). 

The omnibus ANOVAs indicated a significant effect of text difficulty on the 
majority of indices (indicated in bold). The exceptions are lexical diversity in Set B, 
and familiarity in Set B and the combined set. Post hoc tests revealed that the 
significant effects were driven by differences between the ‘low’ level compared to 
the ‘high’ and ‘very high’ levels. Notably, few indices showed differences between the 
‘middle’ and ‘high’ levels. In sum, these tests confirm that these linguistic features are 
appropriate for determining text difficulty. 


Machine Learning Algorithms 

In this study, we compared some of the most commonly used classification algorithms. 
A summary of these algorithms is given in Table 4 (see also Appendix B and Balyan 
et al. 2017). While determining the accuracy of models, the model parameters were 
tuned wherever applicable using the Grid Search approach. 


Experiments and Results 


After the confirmatory ANOVAs (Table 3), we used the NLP features in a series of 
classification experiments to predict human ratings of text difficulty. As our data 
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involved categorical human ratings, we adopted supervised machine learning classifi- 
cation approaches rather than regression. We conducted the following experiments: 


Experiment 1 (FKGL) 


Flat (non-hierarchical) classification comparing a ZeroR baseline classifier to more than 
40 different classifiers, using FKGL as the sole predictor. (For clarity, we report only 
the results of eight classifiers that achieved the highest accuracy). 


Experiment 2 (FKGL+) 


Flat (non-hierarchical) classification comparing ZeroR to the other classifiers using 
FKGL in conjunction with additional linguistic features derived from the feature 
selection (Table 2) to investigate the benefits of including the additional linguistic 
features. 


Experiment 3 


Hierarchical classification using the most accurate single classifiers (not ensembles) 
obtained from Experiments | and 2 to examine the potential advantages of a hierar- 
chical approach. 

The classifiers used for the experiments are implemented using Weka tool version 
3.8.1 and R packages. We used 10-fold stratified cross validation to compute the 
performance metrics of the classification models. Using only a single split test and 
train data (or a holdout method) can lead to high variance and biased results. The results 
may depend heavily on the data points included in the training and test set. Therefore, 
we used multiple (10) data splits for training and test data, hence called 10-fold cross 
validation. Using this method ignores the fact how the data is divided as every data 
point gets to be test data point once and training data point 9 times, thus reducing the 
variance as number of folds increase. The classification accuracy in this study is the 
proportion of true results (both true positives and true negatives) among the total 
number of examined cases. We also report accuracy and the F-scores for clarity. 


Experiment 1: Non-hierarchical Classification Using FKGL The first set of experiments 
considered only FKGL to predict text difficulty. Table 5 shows the best classifiers of 
the 40 that were tested. Note that these classifiers are consistent with those shown to be 
the most accurate in a number of other text classification applications (Aggarwal and 
Zhai 2012; Hartmann et al. 2019; Kowsari et al. 2019; Sun and Lim 2001). The 
baseline classifier (ZeroR) classification accuracy for Set A was lower (0.38) than that 
of Set B (0.52) and the combined set (0.42). The accuracy for the different classifiers 
varied across data sets (Set A: 0.66—-0.71; Set B: 0.45—0.60; Combined: 0.45—0.56), but 
accuracy of the classifiers improves significantly over the baseline classifier (ZeroR). 

For Set A, several algorithms including Naive Bayes, linear discriminant analysis 
(LDA) and AdaBoost achieved the highest classification accuracy (0.71) and highest 
kappa (0.56). 

For Set B, the classification accuracy varied from 0.45 to 0.60. Notably, the highest 
accuracy obtained for the classifiers in Set B was less than the lowest classification 
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Table 2 Description of the Linguistic Features 


Linguistic Feature Description 
Flesch-Kincaid Grade FKGL is a simple measure of readability computed using average number of 
Level (FKGL) syllables per word and average number of words per sentence. Number of 


words in a sentence correlates with the effort required to read the sentence. 
While, number of syllables in a word is inversely related with word 
frequency, and affects reading difficulty (Zipf 1949; Dufty et al. 2006). Lower 
grade levels correspond to easier texts. 


L2 readability score L2 readability score predicts the readability of texts for second-language learners 
(Crossley et al. 2012). L2 readability score considers content word overlap, 
sentence syntactic similarity, and word frequency. In contrast to FKGL, 
higher scores indicate easier texts. 


Syntactic complexity Syntactic complexity of a sentence is determined by considering mean number 
of words before the main verb, and higher number of higher-level constituents 
per word in the sentence. Sentences having less syntactic complexity are 
easier to process and comprehend (Crossley et al. 2012; Perfetti et al. 2005). 


Uncommon or rare words Uncommon or rare words in a text refers to how rarely a word occurs in the 
English language. More uncommon or rare words in a text make the text more 
difficult. The text difficulty is expected to increase if there are words that 
readers have never or rarely encountered. This index is computed from 
CELEX (Baayen et al. 1995), a 17.9 million words corpus. 


Lexical diversity Lexical diversity refers to the variety of words used in a text. Lexical diversity is 
usually measured using type-token ratios (TTR), which is related to text 
length. In order to consider indices regardless of text length, we consider 
MTLD (measure of textual, lexical diversity; McCarthy 2005) and D values 
(Malvern et al. 2004; McNamara et al. 2013) computed by our NLP tool. The 
index for MTLD approach is calculated as the mean length of sequential word 
strings that maintain a criterion level of lexical variation or a given TTR value 
(McCarthy 2005). 


Word familiarity Word familiarity refers to how familiar or easily an adult recognizes a word. For 
example, the words ‘cat’, ‘dog’, ‘table’, ‘fan’ have a higher average 
familiarity as compared with the words ‘cortex’, ‘dogma’, and ‘wigwam’. 
Word familiarity ratings are computed using the MRC Psycholinguistic 
Database, which provides ratings for several thousands of words along several 
psychological dimensions. Sentences that contain words that are more 
familiar are processed more quickly (McNamara et al. 2013). 


Word imageability Word imageability refers to the ease with which one can construct a mental 
image of a word in one’s mind. High-imagery words include terms such as 
‘airplane’ or ‘hammer’, whereas words like ‘dogma’ or ‘quantum’ are much 
less imageable (Paivio et al. 1968). 


Age of Acquisition Age of Acquisition refers to the age at which a word first appears in a child’s 
vocabulary (Paivio et al. 1968). 


accuracy for Set A. Additionally, most of the classifiers failed to predict any instance of 
‘Medium’ class except the Random Forest or the ensemble classifiers (such as Bagging 
and Boosting) that used Random Forest as the base classifier. 

The baseline classification accuracy for the combined dataset was 0.42. The classi- 
fication accuracy for rest of the classifiers varied between 0.45 and 0.56. The two 
classifiers with the highest accuracy (SVM, Naive Bayes) did not predict any instance 
of the ‘Very Difficult’ class. Other well-known classifiers (e.g., BayesNet, neural 
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Table 4 Brief description of classification algorithms used in the study 


Algorithm 


ZeroR 


Naive Bayes 


Logistic Regression 


Support Vector Machines 
(SVM) 


Linear Discriminant 
Analysis (LDA) 


MultiClass 


Boosting 


Bagging 


Description 


ZeroR was used as the baseline classifier for our experiments. It uses the simplest 
classification approach which relies only on the target and ignores all 
predictors. It predicts the majority class in the data. There is no predictability 
power in ZeroR. Thus, it is useful for determining a baseline performance as a 
benchmark for other classification methods (Witten et al. 1999). 


Naive Bayes is based on the Bayes’ theorem of posterior probability. It is a 
probabilistic learning method, which assumes that the effect of an attribute 
value on a given class is independent of other attributes values (McCallum 
and Nigam 1998). 


Logistic regression is a statistical model that (in its basic form) uses a logistic 
function to model a binary dependent variable, although many more complex 
extensions exist. In regression analysis, logistic regression (or logit regression) 
is estimating the parameters of a logistic model; a form of binary regression 
(George-Nektarios 2013). 


SVM constructs a hyperplane that separates the data into classes. SVMs are 
efficient for high-dimensional feature spaces and are among the best super- 
vised learning algorithms (Dumais et al. 1998; Joachims 1998). 


LDA, or normal discriminant analysis (NDA), or discriminant function analysis 
is a generalization of Fisher’s linear discriminant, a method used in statistics, 
pattern recognition and machine learning to find a linear combination of 
features that characterizes or separates two or more classes of objects or 
events. (Martinez and Kak 2001; Mika et al. 1999). 


Multiclass is a metaclassifier for handling multi-class datasets with 2-class 
classifiers. This classifier is also capable of applying error correcting output 
codes for increased accuracy (Rojas 1996; Zhang 2000). 


Boosting is meta-algorithm that incrementally builds an ensemble by iteratively 
training weak leamers or classifiers. While training new models, it emphasizes 
instances that are misclassified by the previous models. Thus, each model is 
trained on weighted data from the previous model performance. The final 
result is the weighted sum of the results of all of the classifiers. LogitBoost is 
used for performing additive logistic regression, and AdaBoost boosts a 
nominal class classifier using the AdaBoost M1 algorithm (Freund and 
Schapire 1996; Krogh and Vedelsby 1994). 


Bagging is an ensemble classifier that uses bootstrap aggregation (or “bagging”) 
to reduce variance. This implementation works for both classification and 
regression, depending on the base learner. In the case of classification, 
predictions are generated by averaging probability estimates, not by voting 
(Breiman 1996; Ho 1995; Schapire and Singer 1999; Schélkopf and Smola 
2002). 


network) also failed to predict any instance of the ‘Very Difficult’ class. The ensemble 
classifiers that did predict instances for the ‘Very Difficult’ class (e.g., Bagging and 
Boosting) had low precision (0.19-0.62), recall (0.09—0.22) and F-scores (0.15—0.32). 
The F-score (F,) is computed as the harmonic mean of precision and recall. 
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Experiment 2: Non-hierarchical Classification Using FKGL plus Linguistic Features The 
second set of experiments was conducted using the same set of classifiers used in 
Experiment 1 to examine the extent to which classification accuracy was improved 
when linguistic features (see Table 2) were added as predictors. We refer to this set of 
linguistic features as FKGL+. 

The classification accuracy results for Set A improved from 0.61—0.71 to 0.75—0.82 
when additional linguistic features were used as predictors. Some classifiers such as 
SVM, LDA, and logistic showed an improvement of over 14% in the classification 
accuracy and the F-scores. The classification accuracy for some classifiers (AdaBoost 
and MultiClass) for Set B like Set A also showed improvement in accuracy of over 
33%. However, the classification accuracy for the Set B classifiers was still low as 
compared to the classification accuracies and F-scores of Set A. SVM classifier 
obtained one of the highest accuracies for this set but it did not predict any instance 
of the ‘Middle’ class. 

Like the individual data sets, the classification accuracy for the combined set also 
improved and for some classifiers (mostly the ensembles) with improvement over 
20%. However, for Bagging the classification accuracy and F-scores improvement 
was approximately 51%. The SVM classifier achieved the third highest classifica- 
tion accuracy and was able to identify instances of all four classes. The summary 
results of some of the top performing classifiers are shown in Table 5. Since we are 
dealing with multi-class classification in this study with 3 classes in set A and set B, 
and 4 classes in the combined set (Set A+ Set B), there is a separate F-score for 
each class in each data set that is returned by every classifier. We have not reported 
individual F-score for each class, but specify a range of F-scores returned for a 
dataset for every classifier in Table 5 to avoid rendering the table overly complex. 
Precision, recall, F-scores for all of the classes for the ML classifier is provided in 
Appendix A. 

Consistent with Experiment 1, classification accuracies and F-scores were higher 
for Set A as compared to those for Set B and the combined data set. Importantly, 
Experiment 2 demonstrated that the inclusion of the theoretically-motivated lin- 
guistic features improved the accuracy of the classifiers for both sets of texts 
individually as well as the combined set. For clarity, the improvement in classifi- 
cation accuracy when using FKGL+ over FKGL is illustrated in Fig. 3. The details 
for the summarized results of Table 5 for the precision, recall for the classifiers and 
the rationale for these range of values for the accuracy and F-score is provided in 
Appendix A. 


Experiment 3: Hierarchical Classification Experiment 3 examines the potential ad- 
vantages of using a hierarchical, rather than flat classification approach. It was 
predicted that this approach would be more accurate because there were more 
than two classes that were ordinally ranked. From a more theoretical perspective, 
easier texts might be distinct from one another based on more superficial aspects 
of texts (e.g., word difficulty) whereas more difficult texts might vary along 
more complex dimensions of language (e.g., syntax). This multidimensionality of 
language might be better captured in the multiple runs to separate the different 
classes. 
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Table 5 Accuracy and F-Scores for classifiers using FKGL and FKGL+ as predictor variable 


Classifier Features Data Source 
Set A Set B Set A+ Set B 
Accuracy (F-score) Accuracy (F-score) Accuracy (F-score) 
ZeroR (Baseline) 0.38 0.52 0.42 
Naive Bayes FKGL 0.71 (0.62-0.74) 0.59 (0.00—0.70) 0.56 (0.00-0.69) 
FKGL+ 0.77 (0.72-0.86) 0.57 (0.35—0.68) 0.58 (0.47—-0.66) 
Logistic FKGL 0.70 (0.61—0.75) 0.60 (0.00—0.62) 0.56 (0.15—0.69) 
FKGL+ 0.82 (0.76-0.87) 0.63 (0.48—0.68) 0.62 (0.55—0.72) 
SVM FKGL 0.68 (0.59-0.74) 0.57 (0.00-0.69) 0.56 (0.00—0.68) 
FKGL+ 0.80 (0.76-0.88) 0.63 (0.00-0.73) 0.64 (0.39-0.77) 
LDA FKGL 0.71 (0.62-0.77) 0.59 (0.00-0.70) 0.55 (0.07-0.69) 
FKGL+ 0.81 (0.76-0.89) 0.64 (0.50-0.71) 0.61 (0.54—0.68) 
MultiClass FKGL 0.70 (0.56-0.77) 0.45 (0.31-0.54) 0.47 (0.17-0.51) 
FKGL+ 0.79 (0.72-0.88) 0.63 (0.50—0.70) 0.60 (0.49-0.69) 
LogitBoost FKGL 0.66 (0.56-0.76) 0.56 (0.26—-0.66) 0.54 (0.32-0.64) 
FKGL+ 0.75 (0.67-0.86) 0.63 (0.43—0.68) 0.64 (0.37-0.73) 
AdaBoost FKGL 0.71 (0.62-0.77) 0.45 (0.31-0.54) 0.50 (0.17-0.63) 
FKGL+ 0.76 (0.70-0.89) 0.60 (0.44-0.66) 0.64 (0.41-0.75) 
Bagging FKGL 0.70 (0.62-0.76) 0.53 (0.17—0.64) 0.45 (0.19-0.53) 
FKGL+ 0.77 (0.73-0.86) 0.66 (0.43—0.76) 0.68 (0.44-0.77) 


Ensemble classifiers implementation is complex because for each ensemble 
the output of one model is given as input to another model. This process needs 
to be implemented for each level of the hierarchical model, making the training 
process cumbersome and time consuming. For simplicity and ease of implemen- 
tation for hierarchical classification, we considered the classification accuracy 
and F-scores for single classifiers only (Naive Bayes, logistic, SVM and LDA) in 
Table 5. 

We conducted three experiments using hierarchical classification for each of 
the data sets by using combinations of different class/category texts. For 
example, Set A data were classified into three classes: elementary, middle, 
and high. For the first run, we first classified texts as ‘elementary’ and ‘other’. 
At the second level in this experiment the ‘others’ class was classified into 
‘middle’ and ‘high’. For the second run, the texts were first classified as 
‘middle’ and ‘other’ and then the ‘other’ texts were further classified as 
‘elementary’ and ‘high’. Finally, for the third run, the texts were first classified 
as ‘high’ and ‘other’ and then the ‘other’ texts were reclassified as ‘elementary’ 
and ‘middle’. A summary of these experimental combinations for different data 
sets is provided in Table 6. 

The classifier accuracy of each level is determined by comparing the classifier 
output with the actual output at that level. This accuracy was used to select the 
best classifier for a particular level in an experiment or run. The final accuracy 
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(Table 7) is computed by applying the best classifier at each hierarchy level. The 
final accuracy is computed separately and not using the accuracy of individual 
levels. When the whole test set has been divided into the relevant classes, the 
predictions for each class are compared to the human ratings. 

It was observed that out of the four single classifiers discussed above, SVM 
and LDA performed the best for binary classification used in the hierarchical 
approach. The classification accuracy of all the three experiments for the hierar- 
chical approach is summarized in Table 7. This table shows the classification 
results obtained for the final model. We observed that hierarchical classification 
significantly improves the accuracy of the model for both the Set B (from 0.64 
to 0.72) and for the combined set (A+B; from 0.61 to 0.65) when compared 
with accuracy for these sets for the non-hierarchical approach. In contrast, the 
accuracy decreases slightly for Set A (from 0.81 to 0.79). The SVM/LDA values 
mentioned in the “classifier” column in Table 7 indicate that both the classifiers 
(LDA and SVM) performed equally, hence it did not matter which of these 
classifiers was chosen at that level. 

he classification results for each level and the final accuracy in the three experiments 
appear in Table 7. We considered all four single classifiers at each level, but report the 
results of only the best performing model. The highest accuracy for each data set in 
each experiment is indicated in bold. The confusion matrix and text difficulty level- 
wise accuracy are provided in Appendix D. 

The classification results (Table 7) indicate that the model classification 
accuracy depends on the data combinations considered at a specific level of 
hierarchy. Thus, it is important to carefully choose the class combinations for a 
particular hierarchy level for appropriate classification of the data. The results in 
Table 7 indicate that the first experiment (or run) resulted in highest accuracy for 
Set A, while the first and third experiment resulted in highest accuracy for the 
Set B and the second one for the combined (A+B) data (Table 7). 

In sum, the linguistic features of the text differed across the human ratings of 
text difficulty. These features were particularly salient between the ‘elementary’ 
and ‘high’ levels for Set A, and ‘middle’ and ‘college’ levels for Set B. 
Consistent with extant work on text difficulty, the results show that lower level 
texts (‘elementary’ for Set A and ‘middle’ for Set B) are generally less lexically 
and syntactically sophisticated than higher-level texts (‘high’ for Set A and 
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Fig. 3 Comparison of classifier accuracies using only readability (FKGL) and readability in combination with 
additional linguistic features (FKGL+). (Note that ZeroR uses no features, so the experiment is identical for 
FKGL and FKGL+, both bars are included only for ease of comparison) 
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‘college’ for Set B). Lower-level texts also contain less uncommon (rare words) 
and more concrete words than the higher-level texts. The hierarchical classifica- 
tion achieved classification accuracy of 79% for Set A, 72% for Set B, and 65% 
for the combined set A+B. The accuracy improved for Set B and the combined 
set (A+B) but decreased slightly for the Set A when the hierarchical approach 
was used. 


Discussion 


In order to provide individualized reading support for their students, educators 
often rely on readability metrics (National Governors Association Center for Best 
Practices 2010; Graesser et al. 2011). While easy to use, these metrics overlook 
important aspects of text that relate to a students’ ability to understand and learn 
from the text. The current study takes advantage of advances in AI, namely 
natural language processing and hierarchical machine learning to assess text 
difficulty. In the context of iSTART, the ability to rapidly and accurately predict 
human ratings of text difficulty affords the opportunity to provide adaptive 
instruction with a larger corpus of practice items. 

These experiments demonstrate that including additional linguistic features of 
text increased classification accuracy as compared to using FKGL alone. This 
supports theoretical predictions that text complexity emerges from deeper aspects 
of language than merely number of syllables and sentence length as well as 
existing studies demonstrating gains of including additional linguistic aspects of 
text beyond traditional metrics (see Collins-Thompson 2014). Thus, it is impor- 
tant to encourage researchers and instructors to look beyond traditional readabil- 
ity when considering what texts and tasks might be best suited for their students. 
The ability to provide better metrics in the same amount of time means that 
instructors can use these indices to make more informed decisions. This also 
means that researchers can offer more accurate feedback in the context of 
automated systems. For example, teachers can upload their own texts into the 
iSTART library, which in turn can be classified regarding difficulty. Asking a 
teacher to manually decide if the text they are adding is a level 7, 8, or 9 as 
compared to the other dozens of texts in the extant library would be time 
consuming for the instructor and lend to inconsistent classifications across 


Table 6 Hierarchical classification experiments summary 


Experiment Set A Set B Set A+ Set B 
Run | E + (M/H) M + (H/C) (E/M) + (H/C) 
Run 2 M + (E/H) H+ (M/C) (E/H) + (M/C) 


Run 3 H + (E/M) C + (MW/H) (E/C) + (M/A) 


E: Elementary, M: Middle, H: High, C: College 
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Table 7 Level-based classification and Final accuracy results for Set A, Set B, and Combined Set 


Expt. Partition Classifier Accuracy Final Accuracy 
Set A Run 1| E + Others LDA 0.94 0.79" 
M+H SVM 0.81 
Run 2 M + Others SVM 0.75 0.75 
E+H SVM 1.00 
Run 3 H + Others SVM 0.83 0.76 
E+M SVM 0.90 
Set B Run | M + Others LDA 0.82 0.72* 
H+C LDA 0.86 
Run 2 H + Others LDA 0.64 0.62 
M+C LDA 0.89 
Run 3 C + Others LDA 0.87 0.72* 
M+H LDA 0.80 
Combined Set (A + B) Run 1| (E + M) + (H+C) SVM 0.79 0.64 
E+M LDA 0.82 
H+C SVM/LDA 0.87 
Run 2 (E + H) +(M+C) SVM 0.70 0.65 
E+H SVM 0.93 
M+C LDA 0.94 
Run 3 (E+ C)+(M +H) SVM 0.82 0.62 
E+C SVM/LDA 1.00 
M+H SVM 0.73 


“Highest accuracy; E: Elementary, M: Middle, H: High 
* Highest accuracy; M: Middle, H: High, C: College 
# Highest accuracy; E: Elementary, M: Middle, H: High, C: College 


instructors As we continue to improve our text difficulty algorithms, we will be 
able to include these texts in the adaptive system and have confidence that 
students are receiving the right type of texts at the right time. The algorithm 
ensures that the difficulty levels assigned in the system remain consistent over 
time without needing to rely on the teacher to evaluate a given text in the 
context of the entire library. These classification approaches allow us to continue 
to grow the iSTART library, including texts added by teachers, while still 
providing an adaptive tutoring environment in which students are presented with 
skill-level appropriate readings (McCarthy et al. in press). 

This research also makes multiple theoretical contributions to the literature. At 
the most basic level, this study provides comparisons of different ML approaches 
when used for the classification of expository texts. Important to this study, we 
assessed benchmark text difficulty ratings by hand, rather than relying on pre- 
existing labels of unknown or inaccurate origins. Such an approach allows for a 
more authentic assessment of text difficulty. These experiments demonstrated 
that including additional linguistic features as produced by NLP tools improved 
ML classification accuracy by more than 10% as compared to simple readability 
metrics. These findings are consistent with previous work using regression 
approaches (e.g., Duran et al. 2007; Graesser et al. 2004) and further emphasize 
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the importance of using linguistic features beyond word or syllable ratios when 
considering text complexity. 

The study also contributed to the literature by introducing hierarchical machine 
learning as a means of evaluating text difficulty. When classifying Set B and the 
combined data (A+B) the hierarchical classification approach was more accurate. 
While the classifier was able to classify the text difficulty better for Set A when using 
non-hierarchical classification. The differences in the accuracy of these approaches 
suggest that there are potential differences in the nature of the texts in each text set. As 
seen in Table 3, the differences between the ‘middle’ and ‘high’ text sets were reduced 
when the sets were combined. The Set B texts are scientific texts appropriate for high 
school and college students, whereas the Set A texts were designed to include a broader 
range of reading skills and topics. Given that the two sets were developed for different 
purposes and scored independently of one another, it is appropriate to assume that they 
are not perfectly comparable, which is typical given the variety of educational situa- 
tions and pedagogical objectives. However, one of the drawbacks of the hierarchical 
classification is that the approach is more resource intense. Several ML models (same 
or different for each level) on each text must be run while testing the models in order to 
classify unseen texts and to train any new models. 


Limitations and Future Directions 


A benefit to this study was that we use a pre-existing corpus, which provides ecological 
validity for our approach. That is, these models are built upon making decisions about 
real-world text sets that may have varying numbers of instances of particular cases. 
However, a resulting limitation of this choice of corpus is that the set is relatively small 
and unbalanced. Future work should examine the utility of these approaches using 
larger corpora to examine the generalizability of our findings. We did not find any 
study that used word embedding for classifying text difficulty except one by Jiang et al. 
2018, although it has been successfully used for several other NLP tasks such as text 
classification, text summarization and sentiment analysis. Therefore, we also plan to 
use word embedding for the study in future and compare how this performs compared 
to other approaches used. We also plan on exploring how ordinal logistic regression 
would perform compared to the hierarchical classification approach. 

As a result of these findings, separate algorithms to classify text difficulty will be 
implemented in iSTART depending on the module and target population. These 
experiments demonstrate the utility of hierarchical classification approaches for 
predicting text difficulty. However, these findings also highlight that this approach 
was not universally more accurate than a flat classification. As such, these findings 
highlight that improving the accuracy and efficacy of educational technologies may 
require relying on different approaches depending on the specific aspect of the target 
texts and population. 
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Appendix A 


Table 8 The evaluation Metrics (Set A) using FKGL 


Machine Learning 
Algorithm 


ZeroR 


AdaBoost (RandomForest) 


BayesNet 


LogitBoost (Decision Stump) 


Neural network 


SMO (puk kernel) 


Logistic 


MultiClassClassifier (Logistic) 


Bagging (HoeffdingTree) 


Naive Bayes 


AdaBoost (Hoefdding Tree) 


Trees (Hoefdding) 


LDA 


Features 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


Precision 


Recall 


F-measure 


Overall Kappa 
Accuracy 


0.38 0.0 
0.61 0.41 
0.64 045 
0.66 0.49 
0.67 0.50 
0.68 0.51 
0.70 0.54 
0.70 0.54 
0.70 0.55 
0.71 0.56 
0.71 0.56 
0.71 0.56 
0.71 0.56 
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Class 


Easy 


Medium 


Difficu’ 
Easy 


Medium 


Difficu’ 
Easy 


Medium 


Difficu’ 
Easy 


Medium 


Difficu’ 
Easy 


Medium 


Difficu’ 
Easy 


Medium 


Difficu’ 
Easy 


Medium 


Difficu’ 
Easy 


Medium 


Difficu’ 
Easy 


Medium 


Difficu’ 
Easy 


Medium 


Difficu’ 
Easy 


Medium 
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Medium 
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Medium 
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Table 9 The evaluation Metrics (Set B) using FKGL 


Machine Learning Algorithm Features Precision Recall 


ZeroR 


Trees (Random Forest) 


AdaBoost (RandomF orest) 


MultiClassClassifier 


(RandomForest) 


Bagging (J48) 


BayesNet 


LogitBoost (Decision Stump) 


SMO (puk kernel) 


Trees (Hoefdding) 


AdaBoost (Hoefdding Tree) 


Naive Bayes 


Neural network 


Logistic 


LDA 


DQ Springer 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


FKGL 


0.00 
0.52 
0.00 
0.31 
0.55 
0.40 
0.31 
0.55 
0.40 
0.31 
0.55 
0.40 
0.27 
0.58 
0.52 
0.00 
0.58 
0.46 
0.36 
0.60 
0.57 
0.00 
0.56 
0.63 
0.00 
0.57 
0.67 
0.00 
0.58 
0.65 
0.00 
0.58 
0.65 
0.00 
0.58 
0.61 
0.00 
0.58 
0.68 
0.00 
0.58 
0.65 


0.00 
1.00 
0.00 
0.32 
0.52 
0.44 
0.32 
0.52 
0.44 
0.32 
0.52 
0.44 
0.12 
0.71 
0.57 
0.00 
0.85 
0.48 
0.20 
0.73 
0.57 
0.00 
0.90 
0.44 
0.00 
0.90 
0.52 
0.00 
0.89 
0.57 
0.00 
0.89 
0.57 
0.00 
0.87 
0.61 
0.00 
0.90 
0.57 
0.00 
0.89 
0.57 


F-measure Overall 


0.00 
0.684 
0.00 
0.31 
0.54 
0.42 
0.31 
0.54 
0.42 
0.31 
0.54 
0.42 
0.17 
0.64 
0.54 
0.00 
0.69 
0.47 
0.26 
0.66 
0.57 
0.00 
0.69 
0.51 
0.00 
0.70 
0.59 
0.00 
0.68 
0.61 
0.00 
0.70 
0.61 
0.00 
0.70 
0.61 
0.00 
0.71 
0.62 
0.00 
0.70 
0.61 


Accuracy 


0.52 


0.45 


0.55 


0.56 


0.57 


0.59 


0.59 


0.60 


0.59 


Kappa_ Class 


0.0 


0.12 


0.12 


0.12 


0.19 


0.18 


0.25 


0.18 


0.23 


0.24 


0.24 


0.25 


0.25 


0.24 


Medium 

Difficult 

Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
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Table 10 The evaluation Metrics (Set A+ Set B) using FKGL 


Machine Learning Algorithm Features Precision Recall F-measure Overall Kappa Class 


Accuracy 
ZeroR FKGL 0.00 0.00 0.00 0.42 0.0 Easy 
0.00 0.00 0.00 Medium 
0.42 1.00 0.59 Difficult 
0.00 0.00 0.00 Very Difficult 
Bagging (RandomForest) FKGL 0.46 0.43 0.44 0.45 0.19 Easy 
0.42 043 0.42 Medium 
0.52 0.54 0.53 Difficult 
0.21 0.17 0.19 Very Difficult 
MultiClassClassifier FKGL 0.64 043 0.51 0.47 0.19 Easy 
(Hoeffding Tree) 0.42 0.54 0.47 Medium 
0.49 0.49 0.49 Difficult 
0.25 0.13 0.17 Very Difficult 
AdaBoost FKGL 0.51 0.52 0.52 0.47 0.21 Easy 
(Random Forest) 0.43 0.44 0.43 Medium 
0.53 0.53 (0.53 Difficult 
0.19 0.17 0.18 Very Difficult 
Trees (Hoefdding) FKGL 0.61 0.64 = 0.63 0.50 0.26 Easy 
0.43 0.52 0.47 Medium 
0.56 0.52 0.54 Difficult 
0.25 0.13 0.17 Very Difficult 
AdaBoost (HoeffdingTree) FKGL 0.61 0.64 = 0.63 0.50 0.26 Easy 
0.43 0.52 0.47 Medium 
0.56 0.52 0.54 Difficult 
0.25 0.13 0.17 Very Difficult 
BayesNet FKGL 0.77 0.48 0.59 0.53 0.26 = Easy 
0.47 0.39 0.43 Medium 
0.52 0.78 0.63 Difficult 
0.00 0.00 0.00 Very Difficult 
Neural network FKGL 0.67 0.67 0.67 0.53 0.28 Easy 
0.46 0.47 0.46 Medium 
0.54 0.64 0.58 Difficult 
0.00 0.00 0.00 Very Difficult 
LogitBoost (REPTree) FKGL 0.67 0.62 0.64 0.54 0.29 Easy 
0.45 0.38 ©6041 Medium 
0.55 0.70 0.61 Difficult 
0.62 0.22 0.32 Very Difficult 
Naive Bayes FKGL 0.72 0.67 0.69 0.56 0.32 Easy 
0.49 0.51 0.50 Medium 
0.57 0.67 = 0.61 Difficult 
0.00 0.00 0.00 Very Difficult 
Logistic FKGL 0.72 0.67 0.69 0.56 0.32 Easy 
0.49 0.43 0.45 Medium 
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Table 10 (continued) 
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Machine Learning Algorithm Features Precision Recall 


SMO (puk kernel) 


LDA 


FKGL 


FKGL 


Table 11 The evaluation Metrics for Set A using FKGL+ 


Machine Learning Algorithm 


ZeroR 


AdaBoost (RandomForest) 


LogitBoost (Decision Stump) 


SMO (poly kernel) 


Logistic 


MultiClassClassifier (Logistic) 


Bagging (HoeffdingTree) 


Naive Bayes 


LDA 


DQ Springer 


Features 


FKGL+ 


FKGL+ 


FKGL+ 


FKGL+ 


FKGL+ 


FKGL+ 


FKGL+ 


FKGL+ 


FKGL+ 


F-measure Overall Kappa Class 


Accuracy 

0.55 0.72 0.63 Difficult 

0.50 0.09 0.15 Very Difficult 

0.74 0.62 0.68 0.56 0.33 Easy 

0.49 0.53 0.51 Medium 

0.57 0.69 0.63 Difficult 

0.00 0.00 0.00 Very Difficult 

0.72 0.67 0.69 0.55 0.31 Easy 

0.50 041 0.45 Medium 

0.54 0.73 0.62 Difficult 

0.33 0.04 0.07 Very Difficult 
Precision Recall F-measure Overall Kappa Class 

Accuracy 

0.38 1.00 0.55 0.38 0.0 Easy 
0.00 0.00 0.00 Medium 
0.00 0.00 0.00 Difficult 
0.90 0.88 0.89 0.76 0.64 Easy 
0.68 0.73 0.70 Medium 
0.76 0.72 0.74 Difficult 
0.86 0.86 0.86 0.75 0.63 Easy 
0.69 06.3 0.67 Medium 
0.74 0.79 0.77 Difficult 
0.92 0.83 0.88 0.80 0.70 Easy 
0.71 0.81 0.76 Medium 
0.83 0.78 0.80 Difficult 
0.88 0.86 0.87 0.82 0.73 Easy 
0.77 0.76 0.76 Medium 
0.83 0.86 0.85 Difficult 
0.88 0.88 0.88 0.79 0.69 Easy 
0.75 0.69 0.72 Medium 
0.78 0.85 0.81 Difficult 
0.92 0.81 0.86 0.77 0.65 Easy 
0.69 0.79 0.73 Medium 
0.78 0.74 0.76 Difficult 
0.92 0.81 0.86 0.77 0.65 Easy 
0.69 0.76 0.72 Medium 
0.77 0.76 0.77 Difficult 
0.92 0.86 0.89 0.81 0.72 Easy 
0.75 0.77 0.76 Medium 
0.81 0.83 0.82 Difficult 
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Table 12 The evaluation Metrics for Set B using FKGL+ 


Machine Learning Algorithm Features Precision Recall F-measure Overall 


ZeroR 


AdaBoost (RandomF orest) 


MultiClassClassifier (Logistic) 


Bagging (HoeffingTree) 


LogitBoost (RandomTRee) 


SMO (poly kernel) 


Naive Bayes 


Logistic 


LDA 


FKGL+ 


FKGL+ 


FKGL+ 


FKGL+ 


FKGL+ 


FKGL+ 


FKGL+ 


FKGL+ 


FKGL+ 


0.00 
0.52 
0.00 
0.56 
0.59 
0.67 
0.52 
0.65 
0.70 
0.53 
0.66 
0.77 
0.53 
0.63 
0.71 
0.00 
0.59 
0.50 
0.38 
0.60 
0.67 
0.52 
0.64 
0.71 
0.52 
0.67 
0.68 


0.00 
1.00 
0.00 
0.36 
0.75 
0.52 
0.48 
0.67 
0.70 
0.36 
0.77 
0.74 
0.36 
0.75 
0.65 
0.00 
0.94 
0.63 
0.32 
0.64 
0.70 
0.44 
0.71 
0.65 
0.48 
0.67 
0.74 


0.00 
0.684 
0.00 
0.44 
0.66 
0.59 
0.50 
0.66 
0.70 
0.43 
0.71 
0.76 
0.43 
0.68 
0.68 
0.00 
0.73 
0.54 
0.35 
0.62 
0.68 
0.48 
0.67 
0.68 
0.50 
0.67 
0.71 


Accuracy 


0.52 


0.60 


0.63 


0.66 


0.63 


0.63 


0.57 


0.63 


0.64 
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Kappa _ Class 


0.0 


0.30 


0.39 


0.42 


0.37 


0.30 


0.29 


0.38 


0.41 


Medium 

Difficult 

Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
Medium 
Difficult 
Very Difficult 
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Table 13 The evaluation Metrics for Set A+ Set B using FKGL+ 


ML Algorithm 


ZeroR 


Bagging (RandomForest) 


MultiClassClassifier (Logistic) 


AdaBoost (Random Forest) 


LogitBoost (DecisionStump) 


Naive Bayes 


Logistic 


SMO (puk kernel) 


LDA 


DQ Springer 


FKGL+ 


FKGL+ 


FKGL+ 


FKGL+ 


FKGL+ 


FKGL+ 


FKGL+ 


FKGL+ 


FKGL+ 


0.00 
0.00 
0.42 
0.00 
0.78 
0.63 
0.69 
0.78 
0.69 
0.53 
0.60 
0.77 
0.79 
0.58 
0.66 
0.64 
0.78 
0.61 
0.64 
0.47 
0.58 
0.54 
0.63 
0.60 
0.68 
0.59 
0.63 
0.68 
0.75 
0.60 
0.63 
0.75 
0.59 
0.57 
0.65 
0.62 


0.00 
0.00 
1.00 
0.00 
0.76 
0.66 
0.76 
0.30 
0.69 
0.45 
0.73 
0.44 
0.71 
0.67 
0.67 
0.30 
0.69 
0.62 
0.71 
0.30 
0.76 
0.52 
0.62 
0.39 
0.76 
0.52 
0.68 
0.57 
0.79 
0.54 
0.75 
0.26 
0.62 
0.52 
0.70 
0.57 


0.00 
0.00 
0.59 
0.00 
0.77 
0.64 
0.72 
0.44 
0.69 
0.49 
0.66 
0.56 
0.75 
0.62 
0.66 
0.41 
0.73 
0.62 
0.67 
0.37 
0.66 
0.53 
0.62 
0.47 
0.72 
0.55 
0.65 
0.62 
0.77 
0.57 
0.68 
0.39 
0.61 
0.54 
0.68 
0.59 


Features Precision Recall F-measure Overall 


Accuracy 


0.42 


0.68 


0.60 


0.64 


0.64 


0.58 


0.62 


0.64 


0.61 


Kappa _ Class 


0.0 


0.52 


0.40 


0.47 


0.46 


0.40 


0.45 


0.46 


0.43 


Easy 

Medium 
Difficult 
Very Difficult 
Easy 

Medium 
Difficult 
Very Difficult 
Easy 

Medium 
Difficult 
Very Difficult 
Easy 

Medium 
Difficult 
Very Difficult 
Easy 

Medium 
Difficult 
Very Difficult 
Easy 

Medium 
Difficult 
Very Difficult 
Easy 

Medium 
Difficult 
Very Difficult 
Easy 

Medium 
Difficult 
Very Difficult 
Easy 

Medium 
Difficult 
Very Difficult 
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Appendix B 


Table 14 Machine Learning algorithms used in the study (Weka 3.8.1) 


363 


BayesNet Naive Bayes 

LDA Logistic 

SimpleLogistic ZeroR 

MultiLayerPerceptron SMO (puk kernel, poly kernel, RBFKernel) 

Trees (DecisionStump, HoeffdingTree, J48, LMT, LogitBoost (DecisionStump, Hoeffding Tree, J48, 
RandomTree, RandomForest, REPTree) LMT, RandomTree, RandomForest, REPTree) 


AdaBoostM1 (DecisionStump, HoeffdingTree, J48, MultiClassClassifier (DecisionStump, Hoeffding Tree, 
LMT, RandomTree, RandomForest, REPTree) J48, LMT, RandomTree, RandomForest, REPTree) 


Bagging (DecisionStump, HoeffdingTree, J48, 


LMT, RandomTree, RandomForest, REPTree) 


Appendix C 


Table 15 Maximum and Minimum un-normalized values for Linguistic indices (Set A) 


Minimum 


Maximum 


Linguistic Indices 


FKGL (Flesch Kincaid Grade Level) 


L2 Readability 

Syntactic Complexity 

Uncommon /rare Words 

Lexical Diversity (MTLD; measure of textual lexical diversity) 
Familiarity 

Imageability 

Age of Acquisition (AoA) 


Table 16 Maximum and Minimum un-normalized values for Linguistic indices (Set B) 


Linguistic Indices 


01.88 
05.66 
0.580 
14.00 
30.71 
570.9 
311.9 
04.61 


Minimum 


06.15 


13.99 
38.23 
0.820 
317.0 
130.2 
597.8 
395.9 
06.28 


Maximum 


14.82 


FKGL (Flesch Kincaid Grade Level) 


L2 Readability 

Syntactic Complexity 

Uncommon /rare Words 

Lexical Diversity (MTLD; measure of textual lexical diversity) 
Familiarity 


Imageability 


03.36 
0.610 
18.00 
29.71 
572.4 
287.3 
05.07 


33.87 
0.780 
120.0 
108.9 
600.1 
365.7 
06.92 


Age of Acquisition (AoA) 
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Table 17. Maximum and Minimum un-normalized values for Linguistic indices (Set A + Set B) 


Linguistic Indices 


FKGL (Flesch Kincaid Grade Level) 


L2 Readability 


Syntactic Complexity 


Uncommon /rare Words 


Lexical Diversity (MTLD; measure of textual lexical diversity) 


Familiarity 


Imageability 


Age of Acquisition (AoA) 


Appendix D 


Table 18 Text Difficulty Level-wise Performance metrics 


Set A 


Set B 


Combined Set 
(A+B) 


Class 


Easy 

Medium 
Difficult 
Medium 
Difficult 

Very Difficult 
Easy 

Medium 
Difficult 

Very Difficult 


Sensitivity 


0.75 
0.92 
0.70 
0.40 
0.85 
0.78 
0.83 
0.54 
0.73 
0.33 


Table 19 Confusion Matrix for Set A 


Prediction 


Easy 
Medium 
Difficult 


Q Springer 


Reference 
Easy 


12 


Specificity 


1.00 
0.72 
0.95 
1.00 
0.58 
0.90 
0.92 
0.84 
0.70 
0.99 


Minimum 
01.88 
03.36 
0.580 
14.00 
29.71 
570.9 
287.3 
04.61 

PPV NPV _ Final Accuracy 

1.00 0.92 0.79 

0.67 = 0.93 

0.89 0.84 

1.00 0.83 0.72 

0.68 0.79 

0.70 0.93 

0.67 0.97 0.65 

0.64 = 0.78 

0.65 0.78 

0.67 0.95 

Medium 

0 

22 

2 
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Maximum 


14.82 
38.23 
0.820 
317.0 
130.2 
600.1 
395.9 
06.92 


Kappa 


0.68 


0.52 


0.47 


Difficult 
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Table 20 Confusion Matrix for Set B 


Prediction Reference 
Medium Difficult Very Difficult 
Medium 4 0 
Difficult 6 17 
Very Difficult 0 3 


Table 21 Confusion Matrix for the Combined Set (A + B) 


Prediction Reference 

Easy Medium Difficult Very Difficult 
Easy 10 4 1 0 
Medium 0 14 7 1 
Difficult 2 8 24 3 
Very Difficult 0 0 1 2 
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